Beyond Markovian: Bayes-Adaptive RL for Reflective Exploration in LLMs
BARL: More Efficient and Adaptive Reasoning in Large Language Models
Authors: S. Zhang et al.
Published on Arxiv: 2025-05-26
Link: http://arxiv.org/abs/2505.20561v1
Institutions: Northwestern University • Google DeepMind • Google
Keywords: large language models, reinforcement learning, Bayes-adaptive RL, reflective exploration, reasoning, Markov decision process, token efficiency, mathematical reasoning, Bayesian inference, self-reflection, generalization, GRPO, progress reward, Chain-of-Thought, Qwen2.5-Math, DeepSeek-R1-Distill-Llama, GSM8K, MATH, CollegeMath, OlympiadBench
Large Language Models (LLMs) have demonstrated notable reasoning abilities and emergent reflective behaviors, especially when fine-tuned using reinforcement learning (RL). However, traditional Markovian RL frameworks limit exploration to the training phase and restrict policies to only current-state information, neglecting the potential benefits of leveraging history for reflective reasoning during inference. This gap underlines the necessity to optimize and better understand reflective exploration for improved generalization and reasoning efficiency in LLMs.
Building on this context, the authors propose a new Bayes-Adaptive RL (BARL) approach, introducing novel methodologies and contributions:
- Reframe reflective exploration in LLMs within the Bayes-Adaptive RL (BARL) framework, which incentivizes both exploitation and epistemic exploration through belief updates and posterior distributions over possible MDPs.
- Introduce the BARL algorithm, enabling LLMs to generate multiple solution candidates, maintain a posterior over plausible MDPs, and adaptively switch or combine strategies through Bayesian inference based on observed outcomes.
- Conduct experimental evaluations on both synthetic tasks (such as token repetition problems) and mathematical reasoning benchmarks (GSM8K, MATH, CollegeMath, OlympiadBench) using LLMs like Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and R1-Distill-Llama-8B, comparing BARL to state-of-the-art Markovian RL approaches including GRPO and progress-reward baselines.
Moving from the methodology to results, the evaluation demonstrates significant improvements with BARL over traditional methods:
- BARL consistently surpasses standard Markovian RL baselines across all tested benchmarks and models, achieving higher accuracy and superior token efficiency—up to 39% fewer tokens than the progress method, 50% fewer than GRPO, and over 90% fewer than the base model.
- The advantages of BARL are especially significant on complex benchmarks requiring exploratory reasoning (e.g., CollegeMath, OlympiadBench), and are attributed to more efficient exploration and optimal use of thinking tokens, rather than simply increasing the frequency of reflective actions.
- BARL equips LLMs with the ability to adaptively shift strategies in response to discrepancies between learned beliefs and the observed rewards.
Taken together, the results motivate strong conclusions about the benefits of BARL, which are summarized as follows:
- BARL establishes a principled Bayes-adaptive RL framework for guiding LLMs on reflective exploration, promoting more efficient and adaptive reasoning behaviors.
- In contrast to Markovian RL, which often leads to simple memorization and fails to ensure the emergence of reflection, BARL significantly enhances generalization and supports effective exploration at inference.
- Future work aims to extend the versatility of BARL towards broader domains such as code generation and intelligent agent tasks.